## here() starts at /home/rkisleva/ieo/ieoproject-bronchial-detectives
The severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) is a highly pathogenic human zoonotic coronavirus, which causes Coronavirus disease 2019 (COVID-19). In an effort to understand the host transcriptional response to the SARS-Cov-2 virus, Blanco-Melo et al. (2020) sequenced the transcriptome of two different human cell lines, human alveolar adenocarcinoma cells (A549) and primary human bronchial epithelial (NHBE) cells, after infecting them with SARS-Cov-2, seasonal influenza A virus (IAV) and human respiratory syncytial virus (RSV), and growing them in the same culture conditions without infection (mock).
The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE147507. Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.
The directories in this package template follow the structure of an R package (consult this link for further information) adapted for the purpose of data analysis:
R: functions to avoid repeating steps and store auxiliary code, used in the vignette.
vignettes: Rmarkdown document of the data analysis.
inst/extdata: repository of the data to be analyzed and any other kind of additional functional and annotation data employed during the analysis and which is unavailable through an R package. This directory is moved to the package root directory at install.
inst/doc: repository for the results of the analysis that we want to provide without having to run the entire analysis again, e.g., tables of differentially expressed genes. This directory is moved to the package root directory at install, where also the vignettes are installed.
man: manual pages of functions defined in the R directory that for some reason we want to export from the package.
Every other directory you see in this package has been automatically
generated in the package building process as, for instance, the doc
directory with the result of the vignette.
When you edit and adapt this vignette for your own data analysis project,
you should do it in the .Rmd file of the vignettes directory
(not the one copied to the doc directory because this one is overwritten
during package building and if you edit there you will loose your edits).
To build this vignette without building the entire package, you should type
the following in the R shell:
devtools::build_vignettes()
This function call will build your vignette and copy the resulting HTML to
the doc directory. Thus, to see the result, you should go there and open
that HTML file.
The rest of the documentation of this package is provided within the files
of the R directory using
roxygen2,
which means that before you build the entire package you have to generate the
documentation and NAMESPACE file typing in the R shell:
devtools::document()
Both steps, calling devtools::build_vignettes() and devtools::document()
have to be done from the root directory of your package.
We start importing the raw table of counts.
library(SummarizedExperiment)
se <- readRDS(file.path(system.file("extdata",
package="IEOproject"),
"GSE79209.rds"))
se
class: RangedSummarizedExperiment
dim: 25122 81
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25122): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
symbol
colnames(81): SRR3225255 SRR3225256 ... SRR3225335 SRR3225336
colData names(68): title geo_accession ... total alignments:ch1 unique
alignments:ch1
We have 25122 genes by 81 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.
The row data in this object contains information about the profiled genes.
head(rowData(se))
DataFrame with 6 rows and 5 columns
gene_id gene_biotype description
<character> <character> <character>
1 ENSG00000121410 protein_coding alpha-1-B glycoprote..
10 ENSG00000156006 protein_coding N-acetyltransferase ..
100 ENSG00000196839 protein_coding adenosine deaminase ..
1000 ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000 ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
gene_id_version symbol
<character> <character>
1 ENSG00000121410.11 A1BG
10 ENSG00000156006.4 NAT2
100 ENSG00000196839.12 ADA
1000 ENSG00000170558.8 CDH2
10000 ENSG00000117020.16 AKT3
100008586 ENSG00000236362.8 GAGE12F
Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis. Let’s explore now the column (phenotypic) data.
dim(colData(se))
[1] 81 68
head(colData(se), n=3)
DataFrame with 3 rows and 68 columns
title geo_accession status
<character> <character> <character>
SRR3225255 Subject L1, dysplasia GSM2088002 Public on May 25 2017
SRR3225256 Subject L10, dysplasia GSM2088003 Public on May 25 2017
SRR3225257 Subject L11, dysplasia GSM2088004 Public on May 25 2017
submission_date last_update_date type channel_count
<character> <character> <character> <character>
SRR3225255 Mar 14 2016 May 15 2019 SRA 1
SRR3225256 Mar 14 2016 May 15 2019 SRA 1
SRR3225257 Mar 14 2016 May 15 2019 SRA 1
source_name_ch1 organism_ch1 characteristics_ch1
<character> <character> <character>
SRR3225255 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225256 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225257 Bronchial brushing Homo sapiens tissue: Bronchial br..
characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
<character> <character> <character>
SRR3225255 age: 69 Sex: Male smoking status: Form..
SRR3225256 age: 50 Sex: Female smoking status: Curr..
SRR3225257 age: 49 Sex: Male smoking status: Curr..
characteristics_ch1.4 characteristics_ch1.5 characteristics_ch1.6
<character> <character> <character>
SRR3225255 pack years: 58 copd status: No COPD max histology: Mild ..
SRR3225256 pack years: 34 copd status: No COPD max histology: Mild ..
SRR3225257 pack years: 35 copd status: No COPD max histology: Moder..
characteristics_ch1.7 characteristics_ch1.8 characteristics_ch1.9
<character> <character> <character>
SRR3225255 dysplasia status: Dy.. total alignments: 91.. unique alignments: 8..
SRR3225256 dysplasia status: Dy.. total alignments: 96.. unique alignments: 8..
SRR3225257 dysplasia status: Dy.. total alignments: 12.. unique alignments: 1..
characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
<character> <character> <character>
SRR3225255 read 1 alignment: 42.. read 2 alignment: 41.. positive strand alig..
SRR3225256 read 1 alignment: 45.. read 2 alignment: 43.. positive strand alig..
SRR3225257 read 1 alignment: 56.. read 2 alignment: 54.. positive strand alig..
characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
<character> <character> <character>
SRR3225255 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225256 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225257 negative strand alig.. non-splice alignment.. splice alignments: 3..
characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
<character> <character> <character>
SRR3225255 properly paired alig.. genebody 80/20 ratio.. mean gc content: 53.54
SRR3225256 properly paired alig.. genebody 80/20 ratio.. mean gc content: 45.45
SRR3225257 properly paired alig.. genebody 80/20 ratio.. mean gc content: 44.44
molecule_ch1 extract_protocol_ch1 extract_protocol_ch1.1
<character> <character> <character>
SRR3225255 polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225256 polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225257 polyA RNA Bronchial airway bru.. Sequencing libraries..
taxid_ch1 description data_processing
<character> <character> <character>
SRR3225255 9606 Sample_L1 Demultiplexing and c..
SRR3225256 9606 Sample_L10 Demultiplexing and c..
SRR3225257 9606 Sample_L11 Demultiplexing and c..
data_processing.1 data_processing.2 data_processing.3
<character> <character> <character>
SRR3225255 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225256 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225257 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
data_processing.4 data_processing.5 data_processing.6
<character> <character> <character>
SRR3225255 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225256 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225257 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
platform_id data_row_count instrument_model library_selection
<character> <character> <character> <character>
SRR3225255 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR3225256 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR3225257 GPL16791 0 Illumina HiSeq 2500 cDNA
library_source library_strategy relation
<character> <character> <character>
SRR3225255 transcriptomic RNA-Seq SRA: https://www.ncb..
SRR3225256 transcriptomic RNA-Seq SRA: https://www.ncb..
SRR3225257 transcriptomic RNA-Seq SRA: https://www.ncb..
relation.1 supplementary_file_1 age:ch1
<character> <character> <character>
SRR3225255 BioSample: https://w.. NONE 69
SRR3225256 BioSample: https://w.. NONE 50
SRR3225257 BioSample: https://w.. NONE 49
copd status:ch1 dysplasia status:ch1 genebody 80/20 ratio:ch1
<character> <character> <character>
SRR3225255 No COPD Dysplasia 1.34213035203572
SRR3225256 No COPD Dysplasia 1.25508564705863
SRR3225257 No COPD Dysplasia 1.20885950984169
max histology:ch1 mean gc content:ch1
<character> <character>
SRR3225255 Mild dysplasia 53.54
SRR3225256 Mild dysplasia 45.45
SRR3225257 Moderate dysplasia 44.44
negative strand alignments:ch1 non-splice alignments:ch1
<character> <character>
SRR3225255 41653617 62567474
SRR3225256 44222609 64646471
SRR3225257 55046872 79307301
pack years:ch1 positive strand alignments:ch1
<character> <character>
SRR3225255 58 41837114
SRR3225256 34 44366147
SRR3225257 35 55244102
properly paired alignments:ch1 read 1 alignment:ch1
<character> <character>
SRR3225255 66570222 42319429
SRR3225256 70579486 45257546
SRR3225257 87409558 56124363
read 2 alignment:ch1 Sex:ch1 smoking status:ch1
<character> <character> <character>
SRR3225255 41171302 Male Former smoker
SRR3225256 43331210 Female Current smoker
SRR3225257 54166611 Male Current smoker
splice alignments:ch1 tissue:ch1 total alignments:ch1
<character> <character> <character>
SRR3225255 20923257 Bronchial brushing 91000281
SRR3225256 23942285 Bronchial brushing 96810722
SRR3225257 30983673 Bronchial brushing 120861113
unique alignments:ch1
<character>
SRR3225255 83490731
SRR3225256 88588756
SRR3225257 110290974
We have a total of 68 phenotypic variables. The
second column geo_accession contains GEO Sample Accession Number
(GSM) identifers.
GSM identifiers define individual samples, understood in our context as
individual sources of RNA. We can check if we have any technical replicates
as follows:
length(unique(se$geo_accession))
[1] 81
table(lengths(split(colnames(se), se$geo_accession)))
1
81
So, we have 81 different individual samples which matches the number of samples that we have 81. Therefore we don’t have any technical replicates and we can proceed with the next step.
To proceed further exploring this dataset, we are going to use the
edgeR package and build
a DGEList object, incorporating the gene metadata, which includes
the gene symbol.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
head(dge$samples, 3)
group lib.size norm.factors
SRR3225255 1 35504068 1
SRR3225256 1 39395250 1
SRR3225257 1 49591671 1
dim(dge)
[1] 25122 81
Calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. This provides a standarized and normalized representation of gene expression data.
assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
SRR3225255 SRR3225256 SRR3225257 SRR3225258 SRR3225259
1 -2.8144890 -2.9017808 -4.1164076 -2.4911549 -3.436999
10 -1.8212210 -2.6511767 -3.6837191 -2.1459416 -1.077656
100 -0.0186939 0.9340788 0.2471703 -0.6559511 1.072922
1000 1.4206565 2.2413119 3.0040881 0.5618081 2.296503
10000 1.3426936 2.2490363 2.1152217 2.2466499 1.900561
Let’s explore now some of the phenotypic variables. After some visual inspection
of the variables, we find out that the variable se$characteristics_ch1.4 contains
information about how many cigarettes a person has smoked in their lifetime.
We also have a variable which tells us whether the person has COPD
(Chronic obstructive pulmonary disease) or not se$characteristics_ch1.5.
table(se$characteristics_ch1.4)
pack years: 108 pack years: 114 pack years: 26.25 pack years: 30
1 1 1 3
pack years: 31.5 pack years: 32 pack years: 32.5 pack years: 33
1 2 2 1
pack years: 33.25 pack years: 34 pack years: 35 pack years: 35.5
1 2 5 1
pack years: 36 pack years: 36.4 pack years: 36.8 pack years: 37
1 1 1 2
pack years: 37.58 pack years: 38 pack years: 39 pack years: 40
1 2 1 5
pack years: 41 pack years: 41.49 pack years: 42 pack years: 42.5
3 1 2 1
pack years: 42.66 pack years: 43.75 pack years: 45 pack years: 46
1 1 2 1
pack years: 46.8 pack years: 47 pack years: 48 pack years: 49.5
1 2 1 2
pack years: 50 pack years: 50.6 pack years: 51 pack years: 51.25
2 2 4 1
pack years: 52 pack years: 53 pack years: 54 pack years: 57
1 3 2 2
pack years: 58 pack years: 58.24 pack years: 60.8 pack years: 65
2 1 1 2
pack years: 65.7 pack years: 69 pack years: 76 pack years: 77
1 1 1 1
pack years: 90
1
table(se$characteristics_ch1.5)
copd status: COPD copd status: No COPD
24 57
Another variable of interest could be max histology se$characteristics_ch1.6 which
represents the different histology types (tissue characteristics).
table(se$characteristics_ch1.6)
max histology: Hyperplasia max histology: Metaplasia
13 7
max histology: Mild dysplasia max histology: Moderate dysplasia
35 11
max histology: Normal max histology: Severe dysplasia
12 3
To facilitate handling this variables we are going to perform the following modifications:
se$copd_status <- gsub("copd status: ", "", se$characteristics_ch1.5)
se$copd_status <- factor(se$copd_status, levels = c("COPD", "No COPD"))
se$histology <- gsub("max histology: ", "", se$characteristics_ch1.6)
se$histology <- factor(se$histology, levels(se$histology) <- c("Hyperplasia", "Metaplasia", "Mild dysplasia", "Moderate dysplasia", "Normal", "Severe dysplasia"))
To faciliate working with the pack years variable will split it into four categorical groups - Low, Medium, High and Very high.
To do that we will:
1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.
se$characteristics_ch1.4 <- gsub("pack years: ", "", se$characteristics_ch1.4)
class(se$characteristics_ch1.4)
[1] "character"
se$characteristics_ch1.4 <- as.numeric(se$characteristics_ch1.4)
quartiles <- quantile(se$characteristics_ch1.4, probs = c(0.25, 0.5, 0.75))
Now that we have the quartiles, we can proceed to assign categories to the values based on these quartiles.
Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”
se$pack_years <- cut(se$characteristics_ch1.4,
breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
In Table 1 below, we show the three variables of interest (histology, COPD status and pack years) jointly to gather as much understanding as possible on the underlying experimental design.
| Identifer | Pack Years | Histology | COPD Status |
|---|---|---|---|
| SRR3225255 | Very High | Mild dysplasia | No COPD |
| SRR3225256 | Low | Mild dysplasia | No COPD |
| SRR3225257 | Low | Moderate dysplasia | No COPD |
| SRR3225258 | Medium | Mild dysplasia | No COPD |
| SRR3225259 | Medium | Mild dysplasia | No COPD |
| SRR3225260 | Very High | Moderate dysplasia | COPD |
| SRR3225261 | High | Mild dysplasia | COPD |
| SRR3225262 | Very High | Moderate dysplasia | No COPD |
| SRR3225263 | Low | Mild dysplasia | COPD |
| SRR3225264 | High | Mild dysplasia | COPD |
| SRR3225265 | Medium | Moderate dysplasia | COPD |
| SRR3225266 | Low | Hyperplasia | No COPD |
| SRR3225267 | Medium | Hyperplasia | COPD |
| SRR3225268 | Very High | Mild dysplasia | No COPD |
| SRR3225269 | Very High | Mild dysplasia | COPD |
| SRR3225270 | Low | Hyperplasia | COPD |
| SRR3225271 | Medium | Mild dysplasia | No COPD |
| SRR3225272 | High | Normal | No COPD |
| SRR3225273 | Medium | Hyperplasia | COPD |
| SRR3225274 | Very High | Normal | No COPD |
| SRR3225275 | Low | Hyperplasia | No COPD |
| SRR3225276 | High | Mild dysplasia | COPD |
| SRR3225277 | Medium | Mild dysplasia | No COPD |
| SRR3225278 | Medium | Normal | No COPD |
| SRR3225279 | Very High | Hyperplasia | No COPD |
| SRR3225280 | Medium | Normal | No COPD |
| SRR3225281 | Very High | Hyperplasia | No COPD |
| SRR3225282 | Very High | Moderate dysplasia | No COPD |
| SRR3225283 | Low | Mild dysplasia | No COPD |
| SRR3225284 | High | Moderate dysplasia | COPD |
| SRR3225285 | Very High | Severe dysplasia | COPD |
| SRR3225286 | Medium | Normal | No COPD |
| SRR3225287 | Medium | Moderate dysplasia | No COPD |
| SRR3225288 | Medium | Moderate dysplasia | No COPD |
| SRR3225289 | Very High | Normal | No COPD |
| SRR3225290 | High | Normal | No COPD |
| SRR3225291 | Very High | Normal | COPD |
| SRR3225292 | Medium | Mild dysplasia | No COPD |
| SRR3225293 | Very High | Mild dysplasia | No COPD |
| SRR3225295 | Low | Mild dysplasia | No COPD |
| SRR3225296 | High | Mild dysplasia | COPD |
| SRR3225297 | Low | Normal | No COPD |
| SRR3225298 | Very High | Hyperplasia | No COPD |
| SRR3225299 | Low | Mild dysplasia | No COPD |
| SRR3225300 | High | Moderate dysplasia | No COPD |
| SRR3225301 | Medium | Normal | No COPD |
| SRR3225302 | High | Normal | No COPD |
| SRR3225303 | High | Mild dysplasia | No COPD |
| SRR3225304 | Medium | Hyperplasia | No COPD |
| SRR3225305 | Low | Mild dysplasia | No COPD |
| SRR3225306 | Low | Mild dysplasia | No COPD |
| SRR3225307 | High | Mild dysplasia | No COPD |
| SRR3225308 | High | Hyperplasia | No COPD |
| SRR3225309 | Low | Hyperplasia | No COPD |
| SRR3225310 | Very High | Mild dysplasia | COPD |
| SRR3225311 | Medium | Moderate dysplasia | COPD |
| SRR3225312 | High | Mild dysplasia | No COPD |
| SRR3225313 | Low | Mild dysplasia | No COPD |
| SRR3225314 | Low | Hyperplasia | No COPD |
| SRR3225315 | Medium | Moderate dysplasia | COPD |
| SRR3225316 | Low | Mild dysplasia | No COPD |
| SRR3225317 | High | Normal | No COPD |
| SRR3225318 | Low | Mild dysplasia | No COPD |
| SRR3225319 | High | Mild dysplasia | COPD |
| SRR3225320 | Very High | Hyperplasia | COPD |
| SRR3225321 | High | Severe dysplasia | No COPD |
| SRR3225322 | Very High | Mild dysplasia | No COPD |
| SRR3225323 | High | Mild dysplasia | COPD |
| SRR3225324 | Very High | Mild dysplasia | No COPD |
| SRR3225325 | Low | Mild dysplasia | No COPD |
| SRR3225326 | Medium | Mild dysplasia | COPD |
| SRR3225327 | High | Severe dysplasia | COPD |
| SRR3225328 | Very High | Metaplasia | COPD |
| SRR3225329 | High | Metaplasia | No COPD |
| SRR3225330 | Very High | Metaplasia | No COPD |
| SRR3225331 | Medium | Metaplasia | No COPD |
| SRR3225332 | Medium | Mild dysplasia | No COPD |
| SRR3225333 | Low | Metaplasia | No COPD |
| SRR3225334 | Low | Metaplasia | No COPD |
| SRR3225335 | High | Metaplasia | COPD |
| SRR3225336 | Low | Mild dysplasia | No COPD |
In the following step we will evaluate the sequencing depth in terms of the total number of read counts that are mapped to each sample’s genome. The sequencing depth per sample, commonly referred to as library sizes, is displayed in Figure 1 below in increasing order.
Figure 1: Library sizes in increasing order
The plot displays a range of sequencing depths for the samples, each represented by an SRR identifier. The sequencing depth across our samples ranges from 20 to 50 million reads, with the majority falling between 30 to 40 million reads. This indicates a generally high level of coverage, which is beneficial for accurate genomic analysis. The plot also suggests that the highest number of reads corresponds to mild and moderate dysplasia, indicating early stages of cellular abnormalities that could potentially progress to cancer. The predominance of mild dysplasia in these samples could be of particular interest for early detection and intervention studies. There does not appear to be any evident bias in terms of sequencing depth across the conditions represented.
Figure 2 below shows the distribution of expression values per sample in logarithmic CPM units of expression.
Figure 2: Non-parametric density distribution of expression profiles per sample
There are no substantial differences between the samples in the distribution of expression values.
Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.
Figure 3: Distribution of average expression level per gene
The histogram displays a bimodal distribution, which means there are two distinct peaks. The large peak to the left is the most prominent feature of the histogram, showing that most genes in this dataset are expressed at low levels or possibly not at all, under the conditions of the experiment. On the right part, the peak which indicates that as gene expression levels increase, fewer genes are found at these higher expression levels.
In order to filter lowly-expressed genes, we calculate row means of expression data and then we apply logical masks. We filter out genes which do not meet certain criteria such as minimum expression levels and a sample threshold. This step is needed to refine the dataset and remove noise, ensuring the subsequent analysis is based on high-quality data. Figure 4 shows the distribution of those values across genes.
mask <- rowMeans(assays(se)$logCPM) > 1
se.filt <- se[mask, ]
dge.filt <- dge[mask, ]
dim(se.filt)
[1] 13374 81
cpmcutoff <- round(10/min(dge$sample$lib.size/1e6), digits=1)
cpmcutoff
[1] 0.4
nsamplescutoff <- 5
nsamplescutoff
[1] 5
mask <- avgexp >= 4 & avgexp <= 10
se.filt <- se[mask, ]
dge.filt <- dge[mask, ]
dim(se.filt)
[1] 8930 81
Figure 4: Distribution of lowly-expressed genes
We are left with 8930 genes.
We calculate now the normalization factors on the filtered expression data set.
dge.filt <- calcNormFactors(dge.filt)
Replace the raw log2 CPM units in the corresponding assay element of
the SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
normalized.lib.sizes=TRUE)
We examine now the MA-plots of the normalized expression profiles in Figure ??
Figure 5: MA-plots of filtered and normalized expression values
Figure 6: MA-plots of filtered and normalized expression values
Figure 7: MA-plots of filtered and normalized expression values
Figure 8: MA-plots of filtered and normalized expression values
Figure 9: MA-plots of filtered and normalized expression values
For genes above the horizontal line at M=0, expression is higher in the condition being tested against the reference, as for genes below, expression is lower. The red line represents a loess fit to the data, which should ideally be around M=0 if there is no differential expression between conditions. Any systematic deviation from M=0 indicates potential bias or systematic variation in the data. In most of the plots, the red line remains close to M=0 across the range of A values, indicating no systematic bias in expression changes relative to gene abundance. Some plots show a slight dip in the loess line at the lower end of the A scale, suggesting a potential bias for lowly expressed genes which could be potential candidates for differential expression. These points would be of interest for follow-up analyses.
To identify potential columns which could introduce variability to batch, we looked at columns that contain information related to technical aspects, experimental conditions or processing details that could vaey between different batches. Here are some columns which we though could introduce unwanted variability: submission_date, last_update_date, type, channel_count, extract_protocol_ch1, extract_protocol_ch1.1, data_processing, data_processing.1, data_processing.2, data_processing.3, data_processing.4, data_processing.5, data_processing.6, platform_id, instrument_model, library_selection, library_source, library_strategy.
These variables have the same values across all samples which suggests that they are not contributing to the variability of the dataset. For example:
table(se.filt$histology, se.filt$library_strategy)
RNA-Seq
Hyperplasia 13
Metaplasia 7
Mild dysplasia 35
Moderate dysplasia 11
Normal 12
Severe dysplasia 3
There is perfect correlation between library strategy and histology. This means that differences between our samples will be due to biological differences rather than technical.
Now let’s examine the distribution between pack-years and histology.
table(se.filt$pack_years, se.filt$histology)
Hyperplasia Metaplasia Mild dysplasia Moderate dysplasia Normal
Low 5 2 12 1 1
Medium 3 1 7 5 4
High 1 2 9 2 4
Very High 4 2 7 3 3
Severe dysplasia
Low 0
Medium 0
High 2
Very High 1
We can see that not all combinations of pack-years and histology have been sequenced. Some combinations have missing values, indicated by 0s. However, with only two missing combinations out of the several possible combinations, the impact on the overall analysis may be minimal and not critical for the analysis.
We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating pack-years and histology. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.
Figure 10: Figure S6: Hierarchical clustering of the samples
Labels correspond to pack-years and sample identifer, while colors indicate histology groups.
Samples are linked together based on their similarity, with the branches connecting those most similar first. As we move up the tree, the connections represent progressively less similar groups until all are joined in a single cluster. Samples do not really cluster together by histology type, which means histology type is not the primary factor driving the similarity or dissimilarity between samples.
In Figure 11 we show the corresponding MDS plot.
Figure 11: Figure S7: Multidimensional scaling plot of the samples
Labels correspond to treatment and colors indicate sample group.
We can do an analysis to observe differential expression based on the two following groupings:
Histology
We can separate the samples’ Max. Histology in two groups: Normal vs. Dysplasia.
The Normal group will contain all samples that are Normal or have Hyperplasia.
The Dysplasia group will contain all samples with Dysplasia: Mild Dysplasia, Moderate dysplasia and Severe dysplasia
COPD
(Chronic obstructive pulmonary disease)
Samples with COPD vs. samples without COPD
In first instance, a simple way to find Differentially Expressed genes is to simply compare expression levels between the two groups of samples.
We can do this using fold-change, which compares the mean read counts using the logarithmic scale to try and stabilize variability across expression levels.
Our two groups of samples are : samples with COPD and samples with No COPD. We calculate the means of the values of the expression levels.
copd <- rowMeans(logCPM[, se.filt$copd_status=="COPD"])
no_copd <- rowMeans(logCPM[, se.filt$copd_status=="No COPD"])
Now we can build two plots to compare fold-change between the two groups.
par(mfrow=c(1, 2))
plot(copd, no_copd, xlab="COPD", ylab="No COPD", pch=".", cex=4, las=1)
plot((copd+no_copd)/2, copd-no_copd, xlab="(COPD + No COPD)/2", ylab="COPD - No COPD", pch=".", cex=4, las=1)
Our two groups of samples are No Dysplasia and Dysplasia. We create a new column in our SE to dichotomize the histology. We assign “Normal”, “Hyperplasia” and “Metaplasia” to the No Dysplasia group, and we assign “Mild dysplasia”, “Moderate dysplasia”, and “Severe dysplasia” to the Dysplasia group
se.filt$dysplasia[se.filt$histology == "Normal"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Hyperplasia"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Metaplasia"] <- "No Dysplasia"
se.filt$dysplasia[se.filt$histology == "Mild dysplasia"] <- "Dysplasia"
se.filt$dysplasia[se.filt$histology == "Moderate dysplasia"] <- "Dysplasia"
se.filt$dysplasia[se.filt$histology == "Severe dysplasia"] <- "Dysplasia"
#se.filt.dysplasia <- se.filt[, !is.na(se.filt$dysplasia)]
se.filt.dysplasia <- se.filt
no_dysplasia <- rowMeans(logCPM[, se.filt.dysplasia$dysplasia=="No Dysplasia"])
dysplasia <- rowMeans(logCPM[, se.filt.dysplasia$dysplasia=="Dysplasia"])
Now we can build two plots to compare fold-change between the two groups.
par(mfrow=c(1, 2))
plot(no_dysplasia, dysplasia, xlab="No Dysplasia", ylab="Dysplasia", pch=".", cex=4, las=1)
plot((no_dysplasia+dysplasia)/2, no_dysplasia-dysplasia, xlab="(No Dysplasia + Dysplasia)/2", ylab="No Dysplasia - Dysplasia", pch=".", cex=4, las=1)
We perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare two groups of samples in the following ways.
We compare samples that have COPD with samples that don’t have COPD. We have previously subsetted the data the way we needed.
We build now the corresponding full and null model matrices.
mod <- model.matrix(~ se.filt$copd_status, colData(se.filt))
mod0 <- model.matrix(~ 1, colData(se.filt))
Finally, we conduct the F-test implemented in the package sva and
examine the amount of differential expression between samples with COPD
and without.
library(sva)
library(limma)
pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 8
We obtain 8 differentially expressed (DE) genes at FDR < 5%. In Figure 12 below we can see the distribution of the resulting p-values.
Figure 12: Distribution of raw p-values for an F-test on every gene between samples with COPD and samples without COPD
We build a table with the subset of 29 DE genes with FDR < 10% and show the top-10 genes with lowest p-value in Table 3 below.
mask <- p.adjust(pv, method="fdr") < 0.5
DEgenesEGs <- names(pv)[mask]
DEgenesSyms <- mcols(se.filt)[DEgenesEGs, "symbol"]
DEgenesPvalue <- pv[mask]
DEgenesDesc <- mcols(se.filt)[DEgenesEGs, "description"]
DEgenesDesc <- sub(" \\[.+\\]", "", DEgenesDesc)
DEgenesTab <- data.frame(EntrezID=DEgenesEGs,
Symbol=DEgenesSyms,
Description=DEgenesDesc,
"P value"=DEgenesPvalue,
stringsAsFactors=FALSE, check.names=FALSE)
DEgenesTab <- DEgenesTab[order(DEgenesTab[["P value"]]), ] ## order by p-value
rownames(DEgenesTab) <- 1:nrow(DEgenesTab)
| EntrezID | Symbol | Description | P value | |
|---|---|---|---|---|
| 1 | 9875 | URB1 | URB1 ribosome biogenesis 1 homolog (S. cerevisiae) | 3.00e-06 |
| 2 | 865 | CBFB | core-binding factor beta subunit | 9.10e-06 |
| 3 | 134553 | C5orf24 | chromosome 5 open reading frame 24 | 9.60e-06 |
| 4 | 2957 | GTF2A1 | general transcription factor IIA subunit 1 | 1.54e-05 |
| 5 | 6815 | STYX | serine/threonine/tyrosine interacting protein | 1.77e-05 |
| 6 | 26224 | FBXL3 | F-box and leucine rich repeat protein 3 | 2.01e-05 |
| 7 | 55432 | YOD1 | YOD1 deubiquitinase | 3.11e-05 |
| 8 | 7257 | TSNAX | translin associated factor X | 4.22e-05 |
| 9 | 7321 | UBE2D1 | ubiquitin conjugating enzyme E2 D1 | 5.27e-05 |
| 10 | 5532 | PPP3CB | protein phosphatase 3 catalytic subunit beta | 6.65e-05 |
We compare samples that have Dysplasia with samples that don’t have Dysplasia. We have previously subsetted the data the way we needed.
We build now the corresponding full and null model matrices.
mod <- model.matrix(~ se.filt.dysplasia$dysplasia, colData(se.filt.dysplasia))
mod0 <- model.matrix(~ 1, colData(se.filt.dysplasia))
Finally, we conduct the F-test implemented in the package sva and
examine the amount of differential expression between samples with
Dysplasia and samples with Hyperplasia/Normal samples.
library(sva)
library(limma)
pv <- f.pvalue(assays(se.filt.dysplasia)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 2579
We obtain 2579 differentially expressed (DE) genes at FDR < 5% and 3399 at FDR < 10%. In Figure 13 below we can see the distribution of the resulting p-values.
[1] 2579
Figure 13: Distribution of raw p-values for an F-test on every gene between samples with Dysplasia and samples with Hyperplasia/Normal samples
We build a table with the subset of 2579 DE genes with FDR < 5% and show the top-10 genes with lowest p-value in Table 4 below.
mask <- p.adjust(pv, method="fdr") < 0.05
DEgenesEGs <- names(pv)[mask]
DEgenesSyms <- mcols(se.filt.dysplasia)[DEgenesEGs, "symbol"]
DEgenesPvalue <- pv[mask]
DEgenesDesc <- mcols(se.filt.dysplasia)[DEgenesEGs, "description"]
DEgenesDesc <- sub(" \\[.+\\]", "", DEgenesDesc)
DEgenesTab <- data.frame(EntrezID=DEgenesEGs,
Symbol=DEgenesSyms,
Description=DEgenesDesc,
"P value"=DEgenesPvalue,
stringsAsFactors=FALSE, check.names=FALSE)
DEgenesTab <- DEgenesTab[order(DEgenesTab[["P value"]]), ] ## order by p-value
rownames(DEgenesTab) <- 1:nrow(DEgenesTab)
| EntrezID | Symbol | Description | P value | |
|---|---|---|---|---|
| 1 | 29123 | ANKRD11 | ankyrin repeat domain 11 | 0e+00 |
| 2 | 26505 | CNNM3 | cyclin and CBS domain divalent metal cation transport mediator 3 | 0e+00 |
| 3 | 23334 | SZT2 | SZT2, KICSTOR complex subunit | 1e-07 |
| 4 | 5339 | PLEC | plectin | 2e-07 |
| 5 | 23524 | SRRM2 | serine/arginine repetitive matrix 2 | 3e-07 |
| 6 | 79364 | ZXDC | ZXD family zinc finger C | 3e-07 |
| 7 | 57154 | SMURF1 | SMAD specific E3 ubiquitin protein ligase 1 | 4e-07 |
| 8 | 9667 | SAFB2 | scaffold attachment factor B2 | 4e-07 |
| 9 | 4926 | NUMA1 | nuclear mitotic apparatus protein 1 | 5e-07 |
| 10 | 6170 | RPL39 | ribosomal protein L39 | 7e-07 |
Here we will do the functional analysis.
Here we discuss the findings.
Here we summarize our conclusions.
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Oslo
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] sva_3.50.0 BiocParallel_1.36.0
[3] genefilter_1.84.0 mgcv_1.9-1
[5] nlme_3.1-163 geneplotter_1.80.0
[7] annotate_1.80.0 XML_3.99-0.16.1
[9] AnnotationDbi_1.64.1 lattice_0.22-5
[11] edgeR_4.0.16 limma_3.58.1
[13] SummarizedExperiment_1.32.0 Biobase_2.62.0
[15] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
[17] IRanges_2.36.0 S4Vectors_0.40.2
[19] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
[21] matrixStats_1.3.0 usethis_2.2.3
[23] here_1.0.1 kableExtra_1.4.0
[25] knitr_1.46 BiocStyle_2.30.0
loaded via a namespace (and not attached):
[1] viridisLite_0.4.2 blob_1.2.4 Biostrings_2.70.3
[4] bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.14
[7] digest_0.6.35 lifecycle_1.0.4 survival_3.5-8
[10] statmod_1.5.0 KEGGREST_1.42.0 RSQLite_2.3.6
[13] magrittr_2.0.3 compiler_4.3.3 rlang_1.1.3
[16] sass_0.4.9 tools_4.3.3 utf8_1.2.4
[19] yaml_2.3.8 S4Arrays_1.2.1 bit_4.0.5
[22] DelayedArray_0.28.0 xml2_1.3.6 RColorBrewer_1.1-3
[25] abind_1.4-5 KernSmooth_2.23-22 purrr_1.0.2
[28] grid_4.3.3 fansi_1.0.6 xtable_1.8-4
[31] colorspace_2.1-0 scales_1.3.0 tinytex_0.50
[34] cli_3.6.2 rmarkdown_2.26 crayon_1.5.2
[37] rstudioapi_0.16.0 httr_1.4.7 DBI_1.2.2
[40] cachem_1.0.8 stringr_1.5.1 splines_4.3.3
[43] zlibbioc_1.48.2 parallel_4.3.3 BiocManager_1.30.22
[46] XVector_0.42.0 vctrs_0.6.5 Matrix_1.6-5
[49] jsonlite_1.8.8 bookdown_0.39 bit64_4.0.5
[52] systemfonts_1.0.6 magick_2.8.3 locfit_1.5-9.9
[55] jquerylib_0.1.4 glue_1.7.0 codetools_0.2-19
[58] stringi_1.8.3 munsell_0.5.1 tibble_3.2.1
[61] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.11
[64] R6_2.5.1 rprojroot_2.0.4 evaluate_0.23
[67] highr_0.10 png_0.1-8 memoise_2.0.1
[70] bslib_0.7.0 Rcpp_1.0.12 svglite_2.1.3
[73] SparseArray_1.2.4 xfun_0.43 fs_1.6.3
[76] pkgconfig_2.0.3